Apparatus for identifying peptides and proteins by mass spectrometry

ABSTRACT

A method of identifying a protein, polypeptide or peptide by means of mass spectrometry and especially by tandem mass spectrometry is disclosed. The method preferably models the fragmentation of a peptide or protein in a tandem mass spectrometer to facilitate comparison with an experimentally determined spectrum. A fragmentation model is used which takes account of all possible fragmentation pathways which a particular sequence of amino acids may undergo. A peptide or protein may be identified by comparing an experimentally determined mass spectrum with spectra predicted using such a fragmentation model from a library of known peptides or proteins. Alternatively, a de novo method of determining the amino acid sequence of an unknown peptide using such a fragmentation model may be used.

This invention relates to methods of identifying a protein, polypeptideor peptide by means of mass spectrometry and especially by tandem massspectrometry (MS/MS). Preferred methods relate to the use of massspectral data to identify an unknown protein where sequence is at leastpartially present in an existing database.

Although several well-established chemical methods for the sequencing ofpeptides, polypeptides and proteins are known (for example, the Edmandegradation), mass spectrometric methods are becoming increasinglyimportant in view of their speed and ease of use. Mass spectrometricmethods have been developed to the point at which they are capable ofsequencing peptides in a mixture without any prior chemical purificationor separation, typically using electrospray ionization and tandem massspectrometry (MS/MS). For example, see Yates III (J. Mass Spectrom, 1998vol. 33 pp. 1-19), Papayannopoulos (Mass Spectrom. Rev. 1995, vol. 14pp. 49-73), and Yates III, McCormack, and Eng (Anal. Chem. 1996 vol. 68(17) pp. 534A-540A). Thus, in a typical MS/MS sequencing experiment,molecular ions of a particular peptide are selected by the first massanalyzer and fragmented by collisions with neutral gas molecules in acollision cell. The second mass analyzer is then used to record thefragment ion spectrum that generally contains enough information toallow at least a partial, and often the complete, sequence to bedetermined.

Unfortunately, however, the interpretation of the fragment spectra isnot straightforward. Manual interpretation (see, for example, Hunt,Yates III, et al, Proc. Nat. Acad. Sci. USA, 1986, vol. 83 pp 6233-6237and Papayannopoulos, ibid) requires considerable experience and is timeconsuming. Consequently, many workers have developed algorithms andcomputer programs to automate the process, at least in part. The natureof the problem, however, is such that none of those so far developed areable to provide in reasonable time complete sequence information withouteither requiring some prior knowledge of the chemical structure of thepeptide or merely identifying likely candidate sequences in existingprotein structure databases. The reason for this will be understood fromthe following discussion of the nature of the fragment spectra produced.

Typically, the fragment spectrum of a peptide comprises peaks belongingto about half a dozen different ion series each of which correspond todifferent modes of fragmentation of the peptide parent ion. Eachtypically (but not invariably) comprises peaks representing the loss ofsuccessive amino acid residues from the original peptide ion. Becauseall but two of the 20 amino acids from which most naturally occurringproteins are comprised have different masses, it is therefore possibleto establish the sequence of amino acids from the difference in mass ofpeaks in any given series which correspond to the successive loss of anamino acid residue from the original peptide. However, difficultiesarise in identifying to which series an ion belongs and from a varietyof ambiguities that can arise in assigning the peaks, particularly whencertain peaks are either missing or unrecognized. Moreover, other peaksare typically present in a spectrum due to various more complicatedfragmentation or rearrangement routes, so that direct assignment of ionsis fraught with difficulty. Further, electrospray ionization tends toproduce multiply charged ions that appear at correspondingly rescaledmasses, which further complicates the interpretation of the spectra.Isotopic clusters also lead to proliferation of peaks in the observedspectra. Thus, the direct transformation of a mass spectrum to asequence is only possible in trivially small peptides.

The reverse route, transforming trial sequences to predicted spectra forcomparison with the observed spectrum, should be easier, but has notbeen fully developed. The number of possible sequences for any peptide(20^(n), where n is the number of amino acids comprised in the peptide)is very large, so the difficulty of finding the correct sequence for,say, a peptide of a mere 10 amino acids (20¹⁰=10¹³ possible sequences)will be appreciated. The number of potential sequences increases veryrapidly both with the size of the peptide and with the number (at least20) of the residues being considered.

Details of the first computer programs for predicting probable aminoacid sequences from mass spectral data appeared in 1984 (Sakurai,Matsuo, Matsuda, Katakuse, Biomed. Mass Spectrom, 1984, vol. 11(8) pp397-399). This program (PAAS3) searched through all the amino acidsequences whose molecular weights coincided with that of the peptidebeing examined and identified the most probable sequences with theexperimentally observed spectra. Hamm, Wilson and Harvan (CABIOS, 1986vol. 2 (2) pp 115-118) also developed a similar program.

However, as pointed out by Ishikawa and Niwa (Biomed. and Environ. MassSpectrom. 1986, vol. 13 pp 373-380), this approach is limited topeptides not exceeding 800 daltons in view of the computer time requiredto carry out the search. Parekh et al in UK patent application 2,325,465(published November 1998) have resurrected this idea and give an exampleof the sequencing of a peptide of 1000 daltons which required 2×10⁶possible sequences to be searched, but do not specify the computer timerequired. Nevertheless, despite the increase in the processing speed ofcomputers between 1984 and 1999, a simple search of all possiblesequences for a peptide of molecular weights greater than 1200 daltonsis still impractical in a reasonable time using the personal computertypically supplied for data processing with most commercial massspectrometers.

This problem has long been recognized and several approaches torendering the problem more tractable have been described. One of themost successful has been to correlate the mass spectral data with theknown amino acid sequences comprised in a protein database rather thanwith every possible sequence. In the prior method known as peptide massmapping, a protein may be identified by merely determining the molecularweights of the peptides produced by digesting it with a site-specificprotease and comparing the molecular weights with those predicted fromknown proteins in a database. (See, for example, Yates, Speicher, et alin Analytical Biochemistry, 1993 vol 214 pp 397-408). However, massmapping is ineffective if a protein or peptide comprises only a smallnumber of amino acids residues or possible fragments, and isinapplicable if information about the actual amino acid sequences isrequired. As explained, tandem mass spectrometry (MS/MS) can be used toprovide such sequence information. MS/MS spectra usually contain enoughdetail to allow a peptide to be at least partially, and often completelysequenced without reference to any database of known sequences (Seecopending application GB 9907810.7, filed 6 Apr. 1999). There are,however, many circumstances where it is adequate, or even preferred, toestablish sequences by reference to an existing database. Such methodswere pioneered by Yates, et al, see, for example, PCT application95/25281, Yates (J. Mass Spectrom 1998 vol 33 pp 1-19), Yates, Eng et al(Anal. Chem. 1995 vol 67 pp 1426-33). Other workers, including Mørtz etal (Proc. Nat Acad. Sci. USA, 1996 vol 93 pp 8264-7), Figeys, et al(Rapid Commun. Mass Spectrom. 1998 vol 12 pp 1435-44), Jaffe, et al,(Biochemistry, 1998 vol 37 pp 16211-24), Amot et al (Electrophoresis,1998 vol 19 pp 968-980) and Shevchenko et al (J. Protein Chem. 1997 vol16 (5) pp 481-490) report similar approaches.

As explained, it is generally easier to predict a fragmentation massspectrum from a given amino acid sequence than to carry out the reverseprocedure when comparing experimental MS data with sequence databases. A“fragmentation model” that describes the various ways in which a givenamino acid sequence may fragment is therefore required. The chemicalprocesses which result in fragmentation are fairly well understood, butbecause the number of possible routes increases very rapidly with thenumber of amino acid residues in a sequence it is difficult to buildthis knowledge into a definite model. The fragmentation models so farproposed (for example Eng et al, J. Am. Soc. Mass Spectrom, 1994 vol 5pp 976-89) typically incorporate only a small number of possiblefragmentation routes and typically produce a predicted spectrum in whichall the mass peaks have equal probability. This constrained approachcompromises the accuracy of the comparison with an experimentalspectrum, which is likely to represent the sum of many differentfragmentation pathways operating simultaneously with different degreesof importance. Consequently the degree of confidence that can be placedin the identification of a sequence on the basis of the priorfragmentation models is reduced and the chance of an incorrectidentification is increased.

As explained in our copending application (GB 9907810.7, filed 6 Apr.1999) a realistic fragmentation model is also required to predictspectra from pseudo-randomly generated trial sequences (as opposed toexisting sequences comprised in a database). The fragmentation modelsdescribed in the present application are applicable to both approaches.

It is an object of the present invention to provide an improved methodof modelling the fragmentation of a peptide or protein in a tandem massspectrometer to facilitate comparison with an experimentally determinedspectrum. It is another object of the invention to provide such afragmentation model which takes account of all possible fragmentationpathways which a particular sequence of amino acids may undergo. Afurther object of the invention is to provide methods of identifying apeptide or protein by comparing an experimentally determined massspectrum with spectra predicted using such a fragmentation model from alibrary of known peptides or proteins. It is yet another object of theinvention to provide a de novo method of determining the amino acidsequence of an unknown peptide using such a fragmentation model.

In accordance with these objectives the invention provides a method ofidentifying the most probable amino acid sequences which would accountfor the mass spectrum of a protein or peptide, said method comprisingthe steps of:

-   -   a) producing a processable mass spectrum from said peptide; and    -   b) using a fragmentation model to calculate the likelihood that        any given trial amino acid sequence would account for said        processable spectrum, said fragmentation model comprising the        step of summing probabilistically a plurality of fragmentation        routes which together represent the possible ways that said        trial sequence might fragment in accordance with a set of        predefined rules, each said fragmentation route being assigned a        prior probability appropriate to the chemical processes        involved.

In preferred methods, said plurality of fragmentation routes representall the possible ways that a said trial sequence might fragment.

Preferably the fragmentation model is based on the production of atleast two series of ions, the b series (which comprises ionsrepresenting the N-terminal residue of the trial sequence and the lossof C-terminal amino acid residues), and the y″ series (which comprisesions representing the C-terminal residue and the loss of N-terminalamino acid residues). Each family of ions behaves as a coherent series,with neighbouring ions likely to be either both present or both absentThis behaviour may be described by a Markov chain, in which theprobability of an ion being observed is influenced by whether or not itspredecessor was observed. The parameters of the chain may be adjusted totake account of the proton affinities of the residues and their physicalbond strengths. The fragmentation model may be refined by includingother ion series, particularly the a series (b ions which have lost CO),the z″ series (y″ ions which have lost NH₃), and the more general lossof NH₃ or H₂O, again taking account of the probability of the chemicalprocesses involved. Immonium ions equivalent to the loss of CO and Hfrom the various amino acid residues may also be included. Further, thefragmentation model may comprise the generation of sub-sequences ofamino acids, that is, sequences that begin and end at amino acidresidues internal to the unknown peptide. It will be appreciated thatthe more realistic is the fragmentation model, the better will be theaccuracy and fidelity of the computation of the, most probablesequences. It is therefore envisaged that different fragmentation modelsmay be employed if advances are made in understanding the chemicalmechanism by which the mass spectrum of the peptide is produced.

Each of the chemical processes described above may be assigned a priorprobability on the basis of the physical strength of the bonds broken inthe proposed fragmentation step and the proton affinities of the variousamino acid residues, thereby enabling the prior probability of eachcomplete fragmentation route to be calculated. However, using Markovchains to model each of the ion series produced (eg, the b or y″ series)means that it is unnecessary to compute an explicit spectrum for everypossible fragmentation route for comparison with the processablespectrum. Instead, the method of the invention arrives at the sameresult by using the Markov chain representation of the various ionseries to factorize the comparison, so that the likelihood summed overall the fragmentation routes can be computed in polynomial time (in themost preferred embodiment, linear time). This summed likelihood is abetter basis for comparison with the processable spectrum than thelikelihood or other score derived from a single fragmentation route,such as would be produced by prior fragmentation models, because thefragmentation of a real peptide involves many simultaneous routes. Bythe use of a fully probabilistic fragmentation model, therefore, themethod of the invention automatically accounts in a quantitative sense,for this multiplicity of routes.

As explained, using Markov chains to model the fragmentation processallows the sum over all the possible fragmentation patterns to becalculated in linear time (ie, in a time proportional to the number ofamino acid residues in the peptide) rather than in a time proportionalto the exponentially large number of fragmentation patterns themselves.However, it will be appreciated that the invention is not limited to theparticular fragmentation model described above, but includes anyprobabilistic fragmentation model that can be integrated computationallyin polynomial time.

It will be appreciated that trial sequences used in the method of theinvention may be obtained from one or more libraries or databasescontaining sequences or partial sequences of known peptides andproteins, or may be generated pseudo-randomly in a de-novo sequencingmethod, as described in our co-pending patent application (GB 9907810.7,filed 6 Apr. 1999). For example, a fragmentation model according to theinvention may be used to calculate the likelihood of amino acidsequences comprised in an existing protein or peptide databaseaccounting for an experimentally observed mass spectrum of a peptide. Inthis way the peptide, and/or the protein from which it is derived, maybe identified. Conveniently, in such a method, only sequences or partialsequences having a molecular weight in a given range are selected fromthe database for input to the fragmentation model.

The method of the invention assigns a likelihood factor to each trialamino acid sequence considered. The most probable amino acid sequencesin the database (or pseudo-randomly generated sequences) which wouldaccount for the processable spectrum may then be identified as the trialsequences with the highest likelihood factors. However, a more precisemethod that is particularly appropriate in the case of de novosequencing, is to use a Bayesian approach. Each trial sequence isassigned a prior probability on the basis of whatever information isknown about it, including its relationship to the sample from which theprocessable spectrum is obtained. For example, in true de novosequencing the prior probability of a trial sequence may be based on theaverage natural abundances of the amino acid residues it comprises. Inthe case of database searches, it may be known, for example, that thesample is derived from a yeast protein, in which case, sequences in thedatabase derived from yeasts may be assigned a higher prior probability.

The probability of a trial sequence accounting for the processablespectrum is then calculated by Bayes' theorem, that is:Probability (trial sequence AND processable spectrum)=Prior probability(trial sequence)×likelihood factor

In Bayesian terminology, the likelihood factor is:

-   -   Probability (processable spectrum GIVEN trial sequence).

Although in certain simple cases the processable mass spectrum maysimply be the observed mass spectrum, it is generally preferable toconvert the observed spectrum into a more suitable form beforeattempting to sequence the peptide. Preferably, the processable spectrumis obtained by converting multiply-charged ions and isotopic clusters ofions to a single intensity value at the mass-to-charge ratiocorresponding to a singly-charged ion of the lowest mass isotope, andcalculating an uncertainty value for the actual mass and the probabilitythat a peak at that mass-to-charge ratio has actually been observed.Conveniently, the uncertainty value of a peak may be based on thestandard deviation of a Gaussian peak representing the processed peakand the probability that a peak is actually observed may be related tothe signal-to-noise ratio of the peak in the observed spectrum. Theprogram “MaxEnt3™” available from Micromass UK Ltd. may be used toproduce the processable spectrum from an observed spectrum.

In order to carry out the methods of the invention a sample comprisingone or more unknown peptides may be introduced into a tandem massspectrometer and ionized using electrospray ionization. The molecularweights of the unknown peptides may typically be determined by observingthe molecular ion groups of peaks in a mass spectrum of the sample. Thefirst analyzer of the tandem mass spectrometer may then be set totransmit the molecular ion group of peaks corresponding to one of theunknown peptides to a collision cell, in which the molecular ions arefragmented by collision with neutral gas molecules. The second massanalyzer of the tandem mass spectrometer may then be used to record anobserved fragmentation mass spectrum of the peptide. A processable massspectrum may then be derived from the observed spectrum using suitablecomputer software, as explained. If the sample comprises a mixture ofpeptides, for example as might be produced by a tryptic digest of aprotein, further peptides may be analyzed by selecting the appropriatemolecular ion group using the first mass analyzer.

Viewed from another aspect the invention provides apparatus foridentifying the most likely sequences of amino acids in an unknownpeptide, said apparatus comprising a mass spectrometer for generating amass spectrum of a said unknown peptide and data processing meansprogrammed to:

-   -   a) Process data generated by said mass spectrometer to produce a        processable mass spectrum; and    -   b) Calculate the likelihood that any given trial amino-acid        sequence would account for said processable spectrum using a        fragmentation model which sums probabilistically over a        plurality of fragmentation routes which together represent the        possible ways that said trial sequence might fragment in        accordance with a set of predefined rules, each said        fragmentation route being assigned a prior probability        appropriate to the chemical processes involved.

In preferred embodiments, apparatus according to the invention comprisesa tandem mass spectrometer, and most preferably a tandem massspectrometer that comprises a Time-of-Flight mass analyzer at least asits final stage. A Time-of-Flight mass analyzer is preferred because itis generally capable of greater mass measurement accuracy than aquadrupole analyzer. Preferably also the mass spectrometer comprises anelectrospray ionization source into which an unknown peptide sample maybe introduced.

A preferred method of the invention will now be described in greaterdetail by reference to the figures, wherein:

FIG. 1 is a schematic drawing of a tandem TOF mass spectrometer suitablefor generating a mass spectrum from an unknown peptide sample; and

FIG. 2 is a flow chart representing the operation of a method accordingto the invention.

Referring first to FIG. 1, the principal components of a tandemtime-of-flight mass spectrometer suitable for carrying out methodsaccording to the invention are shown in schematic form. An unknownpeptide sample, or a mixture of such samples, is introduced into acapillary 17 comprised in an electrospray ion source generally indicatedby 1. A jet 18 comprising ions characteristic of said peptide isgenerated in the source 1, and at least some of these ions pass throughan aperture in a sampling cone 2 into a first evacuated chamber 3. Fromthe chamber 3 the ions pass through an aperture in a skimmer cone 4 intoa second evacuated chamber 5, and are then transported by means of ahexapole ion guide 6 into a quadrupole mass analyzer 7 disposed in athird evacuated chamber 8.

In a spectrometer of the kind illustrated in FIG. 1, the molecularweight of the peptide may be determined by using the mass analyzer 7 ina non mass-selective mode while a mass spectrum of the sample isacquired. Preferably, the molecular weight is determined to within ±0.5daltons.

In order to record a fragmentation spectrum of an unknown peptide, themass analyzer 7 may be set to transmit only the molecular ions of theunknown peptide (or a selected one of several peptides, if more than oneis present in the sample). Molecular ions of the unknown peptide thenpass from the mass analyzer 7 into a hexapole collision cell 9 whichcontains a collision gas (typically helium or argon) at a pressurebetween 10⁻³ and 10⁻² torr and are fragmented to produce fragment ionswhich are indicative of the sequence of the unknown peptide. Typically,these fragment ions include ions formed by various losses of the aminoacid residues from both the C and N termini of the peptide molecule, asdiscussed in more detail below.

The fragment ions formed in the collision cell 9 pass into atime-of-flight mass analyzer generally indicated by 10 via anelectrostatic lens 11. In the time-of-flight analyzer 10, the ions arereceived by an ion-pusher 12 which causes bunches of ions to travelthrough a drift region 13 from the pusher to an ion-reflector 14, thenback to an ion detector 15, as shown in FIG. 1. The mass of the ions isthen determined by measuring the time taken for them to reach thedetector 15 relative to the time they were ejected from the ion-pusher12. A data acquisition system 16 controls this process and is programmedto carry out a method of the invention as discussed below. The massrange of the entire spectrometer should be at least 2500 daltons and itshould preferably be capable of determining the masses of the fragmentions to at least ±0.5, and preferably ±0.05 daltons. A suitable massspectrometer is obtainable from Micromass UK Ltd as the “Q-Tof”.

Referring next to FIG. 2, a preferred method according to the inventionbegins by acquiring fragmentation mass spectrum of the unknown, peptideusing the tandem mass spectrometer of FIG. 1.

The fragmentation spectrum is in practice complicated by the occurrenceof multiply-charged ions and isotopic clusters (that is, several peaksassociated with a single ion of a particular nominal mass consequentupon the natural abundance of different carbon, hydrogen, oxygen,nitrogen, and sulphur isotopes comprised in the ion). The method istherefore facilitated by conversion of the raw fragmentation spectrum toa “processable” spectrum. In such a spectrum, the multiply-charged ionsmay be converted to a corresponding singly charged ion at theappropriate nominal mass and the minor peaks comprised in each isotopiccluster are subsumed into the main peak representing the parent isotopicvariant (i.e. that comprising ¹²C, ¹⁶O, ¹⁵N, ¹H. ³²S). The program“MaxEnt3™” available from Micromass UK Ltd. may be used for thispurpose, but other software capable of these operations may be employed.

It is also preferable to represent each peak in the processable massspectrum as a single nominal mass value together with an uncertaintyvalue, for example 512.30±0.05 daltons, rather than as a series of realdata points forming an approximately Gaussian peak as it would appear inthe raw spectrum. The program “MaxEnt3™” also carries out thisconversion, but any suitable peak recognition software could beemployed. However, it has been found that the fidelity of the final mostprobable sequences predicted by methods according to the invention instrongly dependent on the range of the masses assigned to theconstituent peaks in the processable mass spectrum. Consequently, boththe calibration of the mass scale of the tandem mass spectrometer andthe conversion of the raw peaks to their normal masses and theiruncertainties must be carried out carefully and rigorously. It has beenfound that the intensities of the peaks in the fragmentation spectrumhave little value in predicting the sequence of an unknown peptide.Instead of intensities, therefore, the peak recognition software shouldcalculate a probability that each peak actually has been detected in thefragmentation spectrum, rather than being due to noise or an interferingbackground. The program “MaxEnt3™” is also capable of this operation.

Once a processable spectrum has been produced from the sample protein orpeptide, trial sequences may be generated pseudo-randomly in the case ofa de novo sequencing method (see, for example, copending patentapplication GB 9907810.7), filed 6 Apr. 1999) or randomly or pseudorandomly selected from a library or database of protein sequences.Typically, these randomly generated or selected sequences may beconstrained by the molecular weight of the peptide when that has beendetermined. In the case of sequences comprised in a database, partialsequences having the requisite molecular weight may be extracted fromlonger sequences in the database. According to the invention, thelikelihood of each trial sequence accounting for the processablespectrum is calculated using a fragmentation model which sumsprobabilistically over all the ways in which a trial sequence mightfragment and give rise to peaks in the processable mass spectrum. Thismodel should incorporate as much chemical knowledge concerning thefragmentation of peptides in the tandem mass spectrometer as isavailable at the time it is constructed. A preferred model incorporatesthe production of the following series of ions:

-   -   a) The b series, (ions representing the N-terminal amino acid        residues and the loss of C-terminal amino acid residues);    -   b) The y″ series, (ions representing the G-terminal amino acid        residues and the loss of N-terminal amino acid residues);    -   c) The a series, (b ions which have lost CO); and    -   d) z″ series, (y″ ions which have lost NH₃);    -   e) more general loss of NH₃ or H₂O.

The two main series of ions (y″ and b) are represented in the preferredfragmentation model by Markov Chains, one for each series. In eachchain, the probability that a particular ion is observed is dependent onthe probability of its predecessor. For example, principally because ofcharge location, the observed y″ ions in a fragmentation spectrum tendto form a coherent series starting with y₁ and usually continuing forsome way with y₂, y₃ . . . , perhaps fading out for a time but likelyappearing again towards y_(n-1) and finally the full molecule. A Markovchain models this behaviour by setting up the probability (P) of y ionsbeing present as a recurrence relation:P(y ₁)=p ₁P(y _(r))=p _(r) P(y _(r-1))+q _(r)(1−P(y _(r-1)))for r=2,3,4, . . . , n where P(y_(r)) is the probability of y_(r) beingpresent and the probability of y_(r) being absent is 1−P(y_(r)). Thecoefficients p and q are transition probabilities that determine howlikely the series is to begin, to end, and to (re-)start. A similarMarkov chain may be set up to represent the b ions.

This is illustrated in FIG. 2 (in which “˜” represents “not present”).Here the y″ series starts with y₁″, which has probability p₁ of beingpresent and hence probability 1−p₁ of not being present. Similarly the bseries starts with b₁, which has its probability p₁ of being present.The numerical values of these and other probabilities depend on thechemistry involved: in fact p₁ for the b series can be set at or nearzero, to incorporate the observation that the b₁ ion is usually absent.If the y₁ ion is present, it induces y₂ with probability p₂, and if not,y₂ is induced with probability q₂, as shown on the right-ward arrows inFIG. 2. The fact that presence of y₂ would usually follow from presenceof y₁, and conversely, is coded by setting$p_{2} > {\frac{1}{2}{and}\quad q_{2}} < {\frac{1}{2}.}$This correlated structure, known as a Markov chain, is continued from y₂to y₃ and similarly up to y_(n). Another such chain defines the bseries. Note that all combinatoric patterns of presence or absence occurin the model, although the transition probabilities are usually assignedso as to favour correlated presences and absences. Transitionprobabilities can be set according to the charge affinity of theresidues, allied to physical bond strengths. For example, a y series islikely to be present at and after a proline residue, so that p_(r) andq_(r) would be assigned higher values if the residue r were proline thanif it were another residue.

The primary Markov chains are supplemented by introducing probabilitiesthat the b series ions may also suffer loss of CO to form ions in the aseries, and that y″ series ions can lose NH₃ to form z″ series ions andthere may be more general loss of NH₃ or H₂O. Each possible process isassigned a probability which depends on the chemistry involved, forexample, the probability of water loss increases with the number ofhydroxyl groups on the fragment's side chains and would be zero if thereare no such hydroxyl groups that could be lost. The fragmentation modelalso allows for the formation of internal sequences starting at anyresidue, according to a probability appropriate for that particularresidue. Internal sequences are often observed starting at prolineresidues, so that the probability of one starting at a proline residueis therefore set high. FIG. 2 also illustrates these extensions.

The formation of Immonium ions (which are equivalent to the loss of COand H from a single residue) is also incorporated in the fragmentationmodel. Only certain residues can generate these ions,.and for those thatdo, appropriate probabilities are set. For example, histidine residuesgenerally result in the formation of an immonium ion at mass 110.072daltons, and the probability of this process is therefore set close to100%.

It will be appreciated that the more realistic is the fragmentationmodel the faster and more faithful will be the inference of the sequenceof the unknown peptide. Consequently, as the understanding of thechemical processes involved in the formation of the fragmentationspectra of peptides advances, it is within the scope of the invention toadjust the fragmentation model accordingly.

The fragmentation model is explicitly probabilistic, meaning that itproduces a probability distribution over all the ways that a trialsequence might fragment (based on the fragmentation model) rather than alist of possible masses in a predicted spectrum. Thus, the likelihoodfactor is computed as the sum over all these many fragmentationpossibilities, so that the fragmentation pattern for a trial sequence isautomatically and individually adapted to the data comprised in theprocessable spectrum. In terms of probability theory, the likelihoodfactor of the processable spectrum D, given a particular trial sequenceS is${P\left( {D\quad{GIVEN}\quad S} \right)} = {\sum\limits_{f}{{P\left( {D\quad{GIVEN}\quad f} \right)}{P\left( {f\quad{GIVEN}\quad S} \right)}}}$where $\sum\limits_{f}$represents the sum over all the permitted fragmentation patterns f, P(DGIVEN f) is the probability of the processable spectrum assuming theparticular fragmentation pattern f, and P(f GIVEN f) is the probabilityof having fragmentation f from the trial sequence S. P(D GIVEN f) isevaluated as the product over all the fragment masses of theprobabilities that the individual fragment masses are present in theprocessable mass spectrum. As explained, this sum can be computed inpolynomial time rather than in a time proportional to the exponentiallylarge number of fragmentation patterns themselves.

Further, methods according to the invention calculate not only ameaningful probability figure for any given trial sequence, but also theprobability of the assignment of each peak in the processable spectrumto a given amino acid residue loss. This quantifies confidence in theidentification of the peptide and indicates the regions in a sequenceabout which some doubt may exist if a single match of very highprobability cannot be achieved.

1. A method of identifying the most probable amino acid sequences whichwould account for the mass spectrum of a protein or peptide, said methodcomprising the steps of: (a) producing a processable mass spectrum fromsaid peptide; and (b) using a fragmentation model to calculate thelikelihood that any given trial amino acid sequence would account forsaid processable spectrum, said fragmentation model comprising the stepof summing probabilistically a plurality of fragmentation routes whichtogether represent the possible ways that said trial sequence mightfragment in accordance with a set of predefined rules, each saidfragmentation route being assigned a prior probability appropriate tothe chemical processes involved.
 2. A method as claimed in claim 1,wherein said plurality of fragmentation routes represent all thepossible ways that a said trial sequence might fragment.
 3. A method asclaimed in claim 2, wherein the sum over all the possible fragmentationroutes is calculated in polynomial time.
 4. A method as claimed in claim2 or 3, wherein the sum over all the possible fragmentation routes iscalculated in a time proportional to the number of amino acid residuesin the peptide.
 5. A method as claimed in claim 1, wherein saidfragmentation model is described using one or more Markov chains torepresent one or more series of ions in which the probability of an ionbeing observed is influenced by whether or not an adjacent ion in theseries was observed.
 6. A method as claimed in claim 1, wherein saidprior probability takes into account the proton affinities of amino acidresidues.
 7. A method as claimed in claim 1, wherein said priorprobability takes into account the physical strength of the bonds brokenin a proposed fragmentation route.
 8. A method as claimed in claim 1,wherein said fragmentation model includes the production of at least theb and y″ series of ions, wherein said b series is defined as comprisingions representing the N-terminal amino acid residue of the trialsequence and the loss of the C-terminal amino acid residues and said y″series is defined as comprising ions representing the C-terminal aminoacid residue and the loss of N-terminal amino acid residues.
 9. A methodas claimed in claim 8, wherein said fragmentation model includes theproduction of the a series of ions, wherein said a series is defined ascomprising b series ions which have lost CO.
 10. A method as claimed inclaim 8, wherein said fragmentation model includes the production of z″series ions, wherein said z″ series is defined as comprising y″ seriesions which have lost NH₃.
 11. A method as claimed in claim 8, whereinsaid fragmentation model includes the production of ions which have lostNH₃ and/or H₂O.
 12. A method as claimed in claim 8, wherein saidfragmentation model includes the production of immonium ions equivalentto the loss of CO and H from amino acid residues.
 13. A method asclaimed in claim 8, wherein said fragmentation model includes thegeneration of sub-sequences of amino acids which begin and end at aminoacid residues internal to the unknown peptide.
 14. A method as claimedin claim 1, wherein a said trial sequence is obtained from one or morelibraries or databases containing sequences or partial sequences ofknown peptides and proteins.
 15. A method as claimed in claim 1, whereina said trial sequence is generated pseudo-randomly using a de-novosequencing method.
 16. A method as claimed in claim 1, wherein saidfragmentation model is used to calculate the likelihood of amino acidsequences comprised in an existing protein or peptide databaseaccounting for an experimentally observed mass spectrum of a peptide.17. A method as claimed in claim 16, wherein only sequences or partialsequences having a molecular weight within a given range are selectedfrom said database for input into said fragmentation model.
 18. A methodas claimed in claim 1, wherein a likelihood factor is assigned to eachtrial amino acid sequence considered.
 19. A method as claimed in claim1, wherein the probability of a trial amino acid sequence accounting forsaid processable mass spectrum is calculated using Bayes' theoremwherein a prior probability assigned to a given trial amino acidsequence is multiplied by a likelihood factor which reflects the degreeof agreement between a predicted mass spectrum resulting from said giventrial amino acid sequence and said processable mass spectrum.
 20. Amethod as claimed in claim 19, wherein said prior probability includesthe average natural abundances of amino acid residues.
 21. A method asclaimed in claim 19, wherein said prior probability is influenced by thepresumed genus or origin of said protein or peptide.
 22. A method asclaimed in claim 1, wherein said processable mass spectrum comprises anobserved mass spectrum.
 23. A method as claimed in claim 1, wherein saidprocessable mass spectrum is obtained by converting multiply-chargedions and isotopic clusters of ions to a single intensity value at themass-to-charge ratio corresponding to a singly-charged ion of the lowestmass isotope.
 24. A method as claimed in claim 23, further comprisingthe step of calculating an uncertainty value for the actual mass and theprobability that a peak at that mass-to-charge ratio has actually beenobserved.
 25. A method as claimed in claim 24, wherein said uncertaintyvalue is based on the standard deviation of a Gaussian peak representingthe processed peak.
 26. A method as claimed in claim 24, wherein theprobability that a peak is actually observed is based on thesignal-to-noise ratio of the peak in the observed spectrum. 27.Apparatus for identifying the most likely sequences of amino acids in anunknown peptide, said apparatus comprising a mass spectrometer forgenerating a mass spectrum of a said unknown peptide and data processingmeans programmed to: (a) process data generated by said massspectrometer to produce a processable mass spectrum; and (b) calculatethe likelihood that any given trial amino-acid sequence would accountfor said processable spectrum using a fragmentation model which sumsprobabilistically over a plurality of fragmentation routes whichtogether represent the possible ways that said trial sequence mightfragment in accordance with a set of predefined rules, each saidfragmentation route being assigned a prior probability appropriate tothe chemical processes involved.
 28. Apparatus as claimed in claim 27,wherein said mass spectrometer comprises a tandem mass spectrometer. 29.Apparatus as claimed in claim 27, wherein said mass spectrometer furthercomprises a Time of Flight mass analyzer.
 30. Apparatus as claimed inclaim 27, wherein said mass spectrometer further comprises anelectrospray ionization source into which an unknown peptide sample maybe introduced.
 31. A method of identifying a most probable amino acidsequence(s) which would account for a fragmentation mass spectrum of aprotein or peptide, said method comprising the steps of: producing thefragmentation mass spectrum from said protein or peptide; providing aplurality of trial amino acid sequences; using a fragmentation model tocalculate a likelihood factor that any given trial amino acid sequencewould account for said fragmentation mass spectrum, said fragmentationmodel comprising the step of summing probabilistically a plurality offragmentation routes which together represent the possible ways thatsaid trial amino acid sequence might fragment in accordance with a setof predefined rules, each said fragmentation route being assigned aprior probability appropriate to the chemical processes involved; andselecting one or more of said trial amino acid sequences which have thehighest likelihood factor(s) as being the most probable amino acidsequence(s) of said protein or peptide.